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Abstract 

Although recent works try to improve collec- 
tive communication in grid systems by separating 
intra and inter-cluster communication, the optimisa- 
tion of communications focus only on inter- cluster 
communications. We believe, instead, that the over- 
all performance of the application may be improved 
if intra-cluster collective communications perfor- 
mance is known in advance. Hence, it is important 
to have an accurate model of the intra-cluster col- 
lective communications, which provides the necessary 
evidences to tune and to predict their performance cor- 
rectly. In this paper we present our experience on 
modelling such communication strategies. We de- 
scribe and compare different implementation strategies 
with their communication models, evaluating the mod- 
els' accuracy and describing the practical challenges 
that can be found when modelling collective communi- 
cations. 

Keywords: collective communication, performance 
models, MPI 

1. Introduction 

The optimisation of collective communications in 
grids is a complex task because the inherent hetero- 
geneity of the network limits the use of general solu- 
tions. To reduce the complexity cost, most systems con- 
sider grids as interconnected islands of homogeneous 
clusters. Although there are no restrictions on the num- 
ber of layer that connect those "islands", as successfully 
demonstrated by 0, most systems only optimise com- 
munications at the inter-cluster level, because wide- 
area networks are slower than LANs. Some examples 
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of this "two-layered" approach include ECO ^1], Mag- 
Pie jHlEnii that apply this concept for wide- area net- 
works, and even LAM-MPI 7 0, that consider SMP 
machines as islands of fast communication. 

We believe that while inter-cluster optimisation is 
necessary to achieve good performances in grid-like 
environments, its optimisation should not be discon- 
nected from the intra-cluster level. Actually, the mod- 
elling and optimisation of intra-cluster communication 
is specially important when the clusters are structured 
in multiple layers. In this situation, the grid-aware tools 
must deal with both communication and topology map- 
ping, and a priori knowledge on the intra-clusters com- 
munication may lead to more important reductions of 
the overall execution time than a simple minimisation 
of the wide- area communications. 

Hence, in this paper we investigate how performance 
models can be used to characterise the communication 
patterns of the collective communications. These mod- 
els can be used both to predict the performance of these 
operations and to decide which implementation tech- 
nique is the better adapted for a specific set of param- 
eters (number of processes, message size, network per- 
formance, etc.). 

Consequently, to model collective communications 
we need a good performance model. There are several 
performance models for message-passing parallel pro- 
grams, some of them widely known like BSP [201 or 
LogP 0. Although these two models are equivalent 
in most circumstances 52], LogP is slightly more gen- 
eral than BSP, as it does not requires a global barrier 
to separate communication and computation phases, 
and because it adds the notion of finite network ca- 
pacity that can only support a certain number of mes- 
sages in transit at once. As consequence, we choose to 
use, in this paper, the parameterised LogP model |in| . 
pLogP is an extension of the LogP model that can accu- 
rately handle both small messages and large messages 
with a low complexity. Due to its simplicity, this model 



allows a fast prototyping of the communication per- 
formance, even though it has difHculties to represent 
contention situations. Nevertheless, our pLogP mod- 
els were able to predict with enough accuracy the sys- 
tem performance in most cases presented in this paper, 
allowing the selection of the most adapted implemen- 
tation technique to a specific network environment. 

To illustrate our approach, we present three exam- 
ples, the Broadcast, Scatter and All-to-All operations, 
which respectively represent the "one-to-many", "per- 
sonalised one-to-many" and "many-to-many" collective 
communications. While conceptually simple. Broadcast 
and Scatter operations have communication patterns 
that can be found in many other operations, like Bar- 
riers, Reduces and Gathers. The All-to-All operation, 
instead, has a complex communication pattern, but is 
one of the most important communication patterns for 
scientific applications. Additionally, an All-to-All op- 
eration is subjected to important problems with com- 
munication contention, representing a real challenge to 
performance modelling. 

The rest of this paper is organised as follows: Sec- 
tion [21 presents the definitions and the test environ- 
ment we will consider along this paper. Sections El 
and present, respectively the communication models 
we developed for both Broadcast, Gather and All-to- 
All, while comparing the predictions from those models 
with experimental results. Finally, Section presents 
our conclusions, as well as the future directions of the 
research. 

2. System Model and Definitions 

In this paper we model collective communications 
using the parameter is ed LogP model, or simply pLogP 
[Tfl| . As pLogP parameters depend on the message 
size, it can be accurate when dealing with both small 
and large messages. Further, the paper that describes 
pLogP presents several communication models for grid- 
aware collective communications, which served as guide 
to many of our own communication models. 

Therefore, all along this paper we shall use the same 
terminology from pLogP's definition, such as g(m) for 
the gap of a message of size m, L as the communica- 
tion latency between two nodes, and P as the number 
of nodes. In the case of message segmentation, the seg- 
ment size s of the message m is a multiple of the size 
of the basic datatype to be transmitted, and it splits 
the initial message m into k segments. Thus, g(s) rep- 
resents the gap of a segment with size s. 

The pLogP parameters used to feed our models were 
obtained with the MPI LogP Benchmark tool using 
LAM-MPI 7.0.4 [12, and are presented in Figure [T] 
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Figure 1: pLogP parameters for the icluster-2 network 



The experiments to obtain pLogP parameters, as 
well as the practical experiments, were conducted on 
the ID/HP icluster-2 from the ID laboratory Cluster 
Computing Centre^. This cluster contains 100 Itanium- 
2 (IA-64) machines (Dual processor, 900MHz, 3GB) 
interconnected by a switched Ethernet 100 Mbps net- 
work, running Red Hat Linux Advanced Server 2.1AS 
with kernel 2.4.18smp. The experiments consisted on 
100 measures for each set of parameters (message size, 
number of processes) , and the values presented here are 
the average of such measures. 

3. One-to-Many: Broadcast 

With Broadcast, a single process, called root, sends 
the same message of size m to all other (P — 1) pro- 
cesses. Classical implementations of the Broadcast op- 
eration rely on li-ary trees characterised by two param- 
eters, A and /i, where A is the maximum number of suc- 
cessors a node can have, and h is the height of the tree, 
the longest path from the root to any of the tree leaves. 
While an optimal tree shape can be deduced from the 
network parameters and from d, h &[1...P-1] for which 
X^iLo > P is true, most MPI implementations usu- 
ally rely on two fixed shapes, the Flat Tree, for small 
number of nodes, and the Binomial Tree. 

Because most MPI implementations rely only on 
Flat and Binomial Broadcast, some techniques were de- 
veloped to improve its efficiency. This way, it is usual 
to apply different strategies according to the message 
size, as for example, the use of a rendezvous message 
that prepares the receiver to the incoming of a large 
message, or the use of non-blocking primitives to over- 
lap communication and computation. Unfortunately, 
such techniques bring only minimal improvements to 
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3.1. Practical Results 



Table 1: Communication models for Broadcast 



strategy 


Communication Model 


Flat Tree 


(P - 1) X g(m) + L 


Flat Tree Rendezvous 


(P - 1) X g(m) + 2 X g(l) + 3 X L 


Segmented Flat Tree 


(P- 1) X {g{s) xk) + L 


Chain 


(P - 1) X (3(m) + L) 


Chain Rendezvous 


(P- 1) X (3(m) + 2 X 3(1) + 3 X L) 


Seg. Chain (Pipeline) 


(P- 1) X {g{s)+L) + 
{g{s) X (fc- 1)) 


Binary Tree 


< R0S2PI X (2 X g(m) + L) 


Binomial Tree 


llog2P] X g(m) + {log2P] X L 


Binomial Tree Rendezvous 


[/ogaPJ X 5(m) + 
r/o£,2Pl X (2 X <,(!) + 3 X L) 


Seg. Binomial Tree 


[/og2PJ X g(s) X fc + [iog2Pl X L 



the final performance, and their efficiency still depends 
mostly on the network characteristics. 

Another possibility, however, is to compose a Chain 
among the processes, pipelining messages This 
strategy benefits from the use of message segmenta- 
tion, presenting many advantages as recent works in- 
dicate pniEHI- In a Segmented Chain Broadcast, the 
transmission of messages in segments allows a node to 
overlap the transmission of segment k and the recep- 
tion of segment k+1, reducing the overall gap time. 

However, the size of the segments should be care- 
fully chosen according to the network environment. In- 
deed, too small messages pay more for their headers 
than for their content, while too large messages do not 
explore enough the network bandwidth. The search for 
the segment size s that minimises the communication 
time can be done using the communication models pre- 
sented on Tableland the network parameters. An ef- 
ficient method consists in searching through all values 
of s such that s — m/2%i G [0 . . .log2m]. To refine 
the search, we can also apply some heuristics like lo- 
cal hill-cHmbing, as proposed by Kielmann et al. dOj. 

In our work we developed the communication mod- 
els for some current techniques and their "fiavours", 
which are presented on Table Most of these vari- 
ations are clearly expensive, while others have only an 
"historical" interest. Hence, we chose for the experi- 
ments from Section 13.11 two of the most efficient tech- 
niques, the Binomial and the Segmented Chain Broad- 
casts, and the simplest one, the Flat Tree Broadcast. 



To evaluate the accuracy of our models, we mea- 
sured the completion time of the Flat, Binomial and 
the Segmented Chain Broadcasts in real experiments, 
and we compared these results with the model predic- 
tions. Although Flat tree is not adequate for a large 
number of processes, we included it because its sim- 
plicity is a good parameter to evaluate other algorithms 
that use more complex strategies. Hence, Figures |21 El 
and 01 present each strategy compared to its perfor- 
mance model's predictions. Despite some performance 
variations found mostly in the Segmented Chain and 
the Binomial Broadcast, we can observe that predic- 
tions seem to follow the real experiments general be- 
haviour. Actually, as these variations are much less im- 
portant in the case of the Flat Broadcast, we think that 
they are related to communication delays in some ma- 
chines, which are further propagated by the message 
forwarding, a characteristic present only on Binomial 
and Chain broadcasts. As the Flat Tree Broadcast con- 
tacts each node directly, variations in a machine can- 
not be propagated to the others, resulting in more ac- 
curate predictions, as observed in Figure ^ 
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Figure 2: Real and expected performance for the Seg 
mented Chain Broadcast 



Figures [2l jSl and 01 however, are not in the same 
scale due to the different performance level of each al- 
gorithm. To compare these algorithms and to better 
observe the models' accuracy, we present on Figure 
the results obtained for a group of 16 machines. Here, 
we observe that the Segmented Chain Broadcast is the 
better adapted strategy for our cluster, even if the mod- 
els predictions have slightly underestimated the com- 
munication cost. While the observed error rate does 
not interfere in the selection process, our attention was 
drawn by the unexpected delay presented by the Bino- 
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Figure 3: Real and expected performance for the Bino- 
mial Broadcast 
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Figure 5: Comparison between models and real results, 
for 16 machines 
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Figure 4: Real and expected performance for the Flat 
Tree Broadcast 



mial broadcast when messages are small. A close look 
on small messages, as presented in FigureEl shows that 
not only the Binomial Broadcast was affected, but also 
the Segmented Chain Broadcast. Although this varia- 
tion does not affect the choice on the best algorithm, 
we decided to investigate it closer. 

In fact, similar discrepancies were already observed 
by the LAM-MPI team and according to Loncaric 
|14j . they can be due to the TCP acknowledgement 
policy in some Linux versions. This problem may de- 
lay the transmission of some small messages even when 
the TCP_NODELAY socket option is active (actually, 
only one every n messages is delayed, with n varying 
from kernel to kernel) . It is true that these effects were 
mostly present in Linux kernels 2.0.x and 2.2.x, but ac- 
cording to Loncaric pi], it seems that "anecdotal evi- 
dence suggests that the improved TCP stack in Linux 
2.4 may have problems with many-to-many communi- 
cation patterns even though each point-to-point link 
performs fine". 
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Figure 6: Detail on performance degradation with small 
messages 



However, if this problem affects the transmission of 
small messages, it should also affect the Segmented 
Chain Broadcast with any message size, as large mes- 
sages are split in segments with relatively small sizes. 
As the delay observed in Figure El does not seem to be 
much evident in the case of Segmented Chain, we be- 
lieve that this problem is also related to the manage- 
ment of the send buffers. We think that the arrival of 
successive segments forces the transmission of the mes- 
sages, masking the undesirable effects when messages 
are larger. We plan to answer this question through 
the investigation of the segmented variations of the 
Flat and the Binomial Broadcast, which similarly to 
the Segmented Chain, have to deal with small mes- 
sages but send many more messages than their tradi- 
tional versions. 



4. Personalised One-to-Many: Scatter 



The Scatter operation, which is also called "person- 
alised broadcast", is an operation where the root holds 
m X P data items that should be equally distributed 
among the P processes, including itself. As this is ex- 
actly the opposite operation from the Gather primi- 
tive, once modelling the Scatter we have a good ap- 
proximation with the Gather model, which represents 
the "Many-to-One" communication pattern. 

In the case of Scatter, whose root holds a different 
message for each process, it is believed that optimal al- 
gorithms for homogeneous networks use flat trees 
and by this reason, the Flat Tree approach is the de- 
fault Scatter implementation in most MPI implemen- 
tations. 

Actually, any other alternative to perform Scatter 
parallelising the communications requires the transmis- 
sion of large sets of data to the auxiliary processes, be- 
cause messages are not identical. Taking for example 
the Binomial tree, the root will send down the tree 
"bulk" messages composed by subsets of the total data. 
Because this strategy allows parallel sends, the com- 
pletion time could be reduced, but because the "bulk" 
messages are larger than a simple message, they take 
more time to be sent. Hence, the efficiency of the Bi- 
nomial Scatter strategy depends on how good the net- 
work deals with large messages, and how the trade-off 
between parallel sends and transmission of large mes- 
sages will affect the completion time. 

Table IS presents the communication model we con- 
structed for the strategies presented above, and in this 
paper we compare Flat Scatter and Binomial Scatter in 
real experiments. In a first look, a Binomial Scatter is 
not as efficient as the Flat Scatter, because each node 
receives from the parent node its message as well as the 
set of messages it shall send to its successors. On the 
other hand, the cost to send these "combined" messages 
(where most part is useless to the receiver and should 
be forwarded again) may be compensated by the possi- 
bility to execute parallel transmissions. As the trade-off 
between transmission cost and parallel sends is repre- 
sented in our models, we can evaluate the advantages 
of each strategy according to the clusters' characteris- 
tics. 



Table 2: Communication models for Scatter 



Strategy 


Communication Model 


Flat Tree 


(P- 1) X g{m) + L 


Chain 


Efji' fO' X m) + (P - 1) X L 


Binomial Tree 


E!LT^'"'9(2^ X m)+\log2P] X L 



is not usually explored by traditional MPI implemen- 
tations. As a Binomial Scatter should balance the cost 
of combined messages and parallel sends, it might oc- 
cur, as in our experiments, that its performance out- 
weighs the "simplicity" from the Flat Scatter with con- 
siderable gains according to the message size and num- 
ber of nodes, as shown Figures [3 and |H| In fact, the 
Binomial Scatter performance depends on the num- 
ber of processes, which gives its characteristic "stair" 
shape, while the Flat Tree model, limited by the time 
the root needs to send successive messages to different 
nodes (the gap), follows a more linear behaviour. The 
varying trade-off on the Binomial Scatter algorithm en- 
courages the use of our models to identify which imple- 
mentation is the better adapted to a specific environ- 
ment and a set of parameters (message size, number of 
nodes), as shown in Figure El 
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Figure 7: Real and expected performance for the Bino- 
mial Scatter 



4.1. Practical Results 

In the case of Scatter, we compare the experimental 
results from Flat and Binomial Scatters with the pre- 
dictions from their models. Due to our network charac- 
teristics, our experiments shown that a Binomial Scat- 
ter can be more efficient than Flat Scatter, a fact that 



Nevertheless, Figure El shows that the models, espe- 
cially in the case of the Binomial Scatter, could not 
avoid a certain level of imprecision. We believe that 
this difference is mostly due to the manipulation of 
large amount of data, which in the case of the Bino- 
mial Scatter is heavily required due to the "combined" 
messages the nodes receive and forward. 
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Figure 8: Real and expected performance for the Flat 
Scatter 
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Figure 9: Comparison between Scatter models and real 
results, for 24 machines 



communication with each other, and by consequence, 
the simplest algorithm for All-to-All is called Direct 
Exchange, where all sends and receives are started si- 
multaneously. 

An example of implementation of the Direct Ex- 
change algorithm is the LAM 6.5.2 MPI_Alltoall [TT] . 
A problem with this algorithm, however, is that pro- 
cesses usually start communication in the same order, 
and consequently, may overload a link by simultane- 
ously sending messages to a single process each "round". 
Hence, a little optimisation consists on rotating the 
communication order from each process, as now im- 
plemented in both LAM 7.0.4 [E] and MPICH 1.2.5 
LL6j. In spite of this optimisation, that avoids the over- 
load of a specific process, both strategies do not min- 
imise communication, and by consequence, communi- 
cation congestion is highly probable when the number 
of nodes increases. 

Thus, a major challenge on modelling the com- 
munication performance of the All-to-All operation is 
the influence of network contention. Models like those 
presented by [H] are simply extension to the Scatter 
model that do not take in account the specificities of 
the All-to-All communication pattern, nor the non- 
deterministic behaviour of the network contention. 

Although non-deterministic behaviours are difficult 
to model, ^ introduced a simple mean to account con- 
tention in shared networks, such as non-switched Eth- 
ernet, consisting in a contention factor 7 that augments 
the linear communication model T: 



T = l 



&7 
W 



5. Many-to-Many: All to All 

The most intensive and one of the most important 
communication patterns for scientific applications is 
the complete exchange, or All-to-All. There are several 
concrete problems whose parallel or distributed algo- 
rithms alternate periods of computing with periods of 
data exchange among the processing nodes, with dif- 
ferent messages for each other process. Actually, the 
All-to-All operation performs a transposition of data 
stored across a set of processes, because every process 
holds m X P data items that should be equally dis- 
tributed among the P processes, including itself. 

There are many works that focus on the optimisa- 
tion of All-to-All and its variant All-to-All-v, where 
messages can have arbitrary sizes. Most of these pro- 
posals are adapted only to specific network structures, 
like meshes, toroids and hypercubes [SI. General solu- 
tions, like those found in well known MPI distributions, 
consider that each process engages a point-to-point 



where / is the link latency, 6 is the message size and 
W is the bandwidth of the link, and 7 is equal to the 
number of processes. Using this approach, they found 
that this simple contention model greatly enhanced the 
accuracy of their predictions for essentially zero extra 
effort. 

Similarly, we assume that contention is sufficiently 
linear to be modelled. Our approach, however, con- 
sists on identifying the performance bounds for the All- 
to-All operation, and deriving a relation between such 
bounds that fits with the experimental results for the 
All-to-All operation. As this ratio depends on the net- 
work characteristics, it is a "signature" of such network, 
and therefore can be used in further predictions to ob- 
tain results with a considerable precision. 

Our performance bounds were also defined as an ex- 
tension to the Scatter model, but they considered the 
main restrictions to the communication in the all-to-all 
pattern, specially the nodes' capacity to overlap sends 
and receives. Indeed, we explore the fact that even if 



Table 3: Communication bounds for the All-to-All op- 
eration 





Communication Model 


Upper Bound 


(P - 1) X g{m) + (P - 1) X or(rn) + L 


Lower Bound 


(P - 1) X os(m) + (P - 1) X or(m) + L 



two messages cannot be sent consecutively in less than 
g through the same link, it takes only os to send a 
message (more specifically, to deliver the message to 
the network card) and or to receive it. Consequently, 
a lower bound represents the capability to access the 
network interface as soon as the precedent send opera- 
tion returned, while in the upper bound a node needs 
to serialise its transmissions due to the link contention. 
These two limits are represented on Table El 

5.1. Practical Results 

To illustrate our approach to represent the All-to- 
All operation in an environment subjected to network 
contention, we present, in Figure a comparison 
among the measured performance for both Direct Ex- 
change algorithm and its optimised version with the 
predicted performance bounds for a group of 24 ma- 
chines. It can be observed that both algorithms behave 
almost identically, and that their performance differs 
from the "Scatter-based" model (Lower bound) in a 
non-negligible amount, which indicates the infiuence of 
network contention. 

In fact, the analysis conducted by Grove [£| indi- 
cated that "slow completion times were due to packet 
losses and their associated TCP/IP retransmit time- 
out, caused by extreme network load". Another fact 
that corroborates Grove's observations is the similar- 
ity between the Direct Exchange and the Optimised 
Direct Exchange performances (Figure EJ- This result 
clearly indicates that the contention in our experiments 
comes from the network itself, and not from the over- 
load of a specific machine. 

Therefore, we were able to determine a ratio between 
the predicted Upper and Lower bounds that provides 
good predictions on the performance of the All-to-All 
operation. This contention ratio 7 is constant and de- 
pends only on the network characteristics, whilst the 
Lower and Upper bounds depend on the number of pro- 
cesses, giving a predicted performance of: 

T = Lower + [Upper — Lower) x 7 

As a result of our practical experiments, the con- 
tention ratio that better represents the characteristics 



of our network was assumed to be 7 = |. The pre- 
dicted performances fit with most of the observed re- 
sults, with a small variation only in the case of small 
messages, which are also subjected to the TCP Ac- 
knowledgement problem discussed on Section ITTl 

This way, despite the non-deterministic behaviour of 
the network contention, we adopted a linear approach 
where a constant factor, characteristic to each network, 
allows the generation of accurate prediction results. 
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6. Conclusions and Future Works 

Existing works that explore the optimisation of het- 
erogeneous networks usually focus only the optimisa- 
tion of inter-cluster communication. We do not agree 
with this approach, and we suggest to optimise both 
inter-cluster and intra-cluster communication. 

For instance, in this paper we propose the use of per- 
formance models to decide, among well known tech- 



niques for collective communication, which is the bet- 
ter adapted for a specific set of parameters (number of 
processes, message size). 

As our approach suggests the use of communication 
models to allow a fast performance prediction, its accu- 
racy needed to be validated. Consequently, in this pa- 
per we presented three cases that compare the models' 
predicted performances and the real results for three 
collective communication patterns - "one-to-all", "per- 
sonaHsed one-to-all" and "many-to-many". We verified 
that the models we construct were accurate enough to 
predict the performance of the collective communica- 
tions, and to allow the selection of the implementation 
strategy that better adapts to our network. 

For the modelling of the All-to-All operations, we 
chose to represent the effects of network contention as 
a linear factor. Although our experiments demonstrate 
that linear assumptions were accurate enough to pre- 
dict the performance of such operation, we agree that 
this approach does not cover all possibilities in a real 
environment. Even though, the results presented in this 
work offers many clues to future investigations on the 
modelling of communication operations subjected to 
non-deterministic network contention behaviours. 

In parallel, we should continue our research on grid- 
aware collective communications. We wish to evaluate 
the accuracy of our models with other network inter- 
connections, like Myrinet, and we are especially inter- 
ested on the automatic organisation of multi-level col- 
lective communications. Hence, our final objective is 
to integrate both performance prediction and wide-area 
communication optimisation in a highly automated col- 
lective communication library for grid environments. 
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